A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
نویسندگان
چکیده
Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being popular methods in this field. In this article we provide a unifying perspective of these two algorithms by showing that their searchdirections in the parameter space are closely related to the search-direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative optimisation method for Markov Decision Processes. We are able to show that the algorithm has numerous desirable properties, absent in the naive application of Newton’s method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent. 1 Markov Decision Processes Markov Decision Processes (MDPs) are the most commonly used model for the description of sequential decision making processes in a fully observable environment, see e.g. [5]. A MDP is described by the tuple {S,A, H, p1, p, π,R}, where S and A are sets known respectively as the state and action space, H ∈ N is the planning horizon, which can be either finite or infinite, and {p1, p, π,R} are functions that are referred as the initial state distribution, transition dynamics, policy (or controller) and the reward function. In general the state and action spaces can be arbitrary sets, but we restrict our attention to either discrete sets or subsets of R, where n ∈ N. We use boldface notation to represent a vector and also use the notation z = (s,a) to denote a state-action pair. Given a MDP the trajectory of the agent is determined by the following recursive procedure: Given the agent’s state, st, at a given time-point, t ∈ NH , an action is selected according to the policy, at ∼ π(·|st); The agent will then transition to a new state according to the transition dynamics, st+1 ∼ p(·|at, st); this process is iterated sequentially through all of the time-points in the planning horizon, where the state of the initial time-point is determined by the initial state distribution s1 ∼ p1(·). At each time-point the agent receives a (scalar) reward that is determined by the reward function, where this function depends on the current action and state of the environment. Typically the reward function is assumed to be bounded, but as the objective is linear in the reward function we assume w.l.o.g that it is non-negative. The most widely used objective in the MDP framework is to maximise the total expected reward of the agent over the course of the planning horizon. This objective can take various forms, including an infinite planning horizon, with either discounted or average rewards, or a finite planning horizon. The theoretical contributions of this paper are applicable to all three frameworks, but for notational ease and for reasons of space we concern ourselves with the infinite horizon framework with discounted rewards. In this framework the boundedness of the objective function is ensured by the
منابع مشابه
Approximate Newton Methods for Policy Search in Markov Decision Processes
Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton’s method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first analy...
متن کاملNon-parametric Policy Search with Limited Information Loss
Learning complex control policies from non-linear and redundant sensory input is an important challenge for reinforcement learning algorithms. Non-parametric methods that approximate values functions or transition models can address this problem, by adapting to the complexity of the data set. Yet, many current non-parametric approaches rely on unstable greedy maximization of approximate value f...
متن کاملA Gauss-Newton Method for Markov Decision Processes
Approximate Newton methods are a standard optimization tool which aim to maintain the benefits of Newton’s method, such as a fast rate of convergence, whilst alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first ana...
متن کاملA Unifying Framework for Temporal Abstraction in Stochastic Processes
This paper presents a framework for unifying the large and growing body of literature that deals with what broadly can be defined as temporal abstraction in Markov Decision Processes (MDPs). MDPs provide an appealing formal framework for modeling a large variety of stochastic problems. The main drawback of this approach is that a requirement of the formal model, i.e., the Markov property, typic...
متن کاملPolicy search in kernel Hilbert space
Much recent work in reinforcement learning and stochastic optimal control has focused on algorithms that search directly through a space of policies rather than building approximate value functions. Policy search has numerous advantages: it does not rely on the Markov assumption, domain knowledge may be encoded in a policy, the policy may require less representational power than a value-functio...
متن کامل